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Abstract: This paper studies sparse linear regression analysis with outliers in the 
responses. A parameter vector for modeling outliers is added to the standard linear 
regression model and then the sparse estimation problem for both coefficients and 
outliers is considered. The ii penalty is imposed for the coefficients, while various 
penalties including redescending type penalties are for the outliers. To solve the 
sparse estimation problem, we introduce an optimization algorithm. Under some 
conditions, we show the algorithmic and statistical convergence property for the 
coefficients obtained by the algorithm. Moreover, it is shown that the algorithm 
can recover the true support of the coefficients with probability going to one. 
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1. Introduction 

Linear regression with a large number of covariates is a general and funda¬ 
mental problem in recent data analysis. A standard method to overcome this 
problem is the least absolute shrinkage and selection operator (Lasso) proposed 


by Tibshirani (1996). For a large number of covariates, it is natural to assume 
the sparsity which means that many of the covariates are not relevant to the 
responses. The Lasso can draw relevant covariates automatically and simultane¬ 
ously estimate the remaining coefficients to be zero. From the theoretical point 


of view, the Lasso has been largely studied. For instance, Bickel et al. (2009), 


Meinshausen and Yu (2009) and Wainwright (2009) gave the convergence rate 


under several norms and showed that the Lasso estimates can recover the true 


support of the coefficients. See also Efron et al. (2004), Zhao and Yu 


Zou 


(2006) and van de Geer and Biihlmann (2009). 


Recent linear regression analysis requires some complex structures in addi- 
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tion to a large number of covariates. One of them is an outlier structure. This 
structure often appears in many applications such as signal detection, image and 
speech processing, communication network and so on. It is well known that the 
standard method which uses the £2 loss outputs inaccurate estimates when out¬ 
liers exist. The most popular way for robustifying against the outliers is to use 
the M-estimation procedure which replaces the I 2 loss by an other loss with a 
bounded influence function; e.g., the Huber’s, the skipped-mean and the Ham¬ 


pel’s robust loss (see, for instance, Huber and Ranchetti (2009) for more details). 
However, optimizing such a robust loss function with a sparse penalty requires 
much computational cost. The (.2 loss function is easier to deal with. Recently, 


She and Owen (2011) proposed a novel approach for robust parameter estimation. 


using an outlier model where a parameter vector for modeling outliers is added 
to the standard linear regression model. The corresponding £2 loss function with 
a sparse penalty for the outlier parameter vector is optimized. The connection 
between sparse penalties and robust loss functions was also illustrated. 

This paper studies sparse and robust linear regression based on their outlier 


model from a theoretical point of view, which was not treated in She and Owen 


(2011). Some theoretical analyses were discussed by Nguyen and Tran (2013), 
but only the penalty was used for the outlier parameters. Their results were 


derived from the fact that the resulting estimate is a global optimum. As She and 


Owen (2011) showed, many penalties having good robustness are non-convex and 


then the resulting estimate is often a local optimum. A general theory including 
non-convex penalties is thus of great interest. 

Main contributions in this paper are the following. We consider a larger 
class of penalties for outlier parameters including non-convex penalties and then 
derive some statistical properties. To avoid the problem of local optima, we 
directly analyze estimated coefficients which an optimization algorithm outputs. 
We provide the upper bound of its ^2 error, which is divided by an algorithmic 
error and a statistical error. It is also shown that the algorithm can recover 
the true support of the coefficients. Thus, this paper bridges a gap between 
the statistical theory and the computation algorithm, which arises in using non- 
convex penalties. 

The remainder of this paper is organized as follows. We introduce the model 
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and the optimization algorithm in Section 2. In Section 3, theoretical analyses for 
the estimated coefficients which the algorithm outputs are provided. We report 
numerical performances in Section 4. All the proofs are in Section 5. 

Throughout the paper, a bold symbol denotes a matrix or a vector and 
its element is written by the fine symbol, e.g., A = [atj) for A G ^nd 

a = (ai,... ,am)'^ for a G M™'. For any vector a G and 1 < (? < oo, we 
dehne the £q norms as ||a||q = and define the io and £oo norms as 

||tt||o = \{'i\cii / 0}| and ||a||oo = maxi<j<m |ai|, respectively. Given a set S' C 
{!,..., m}, as denotes the sub-vector having the elements of a corresponding 
to the set 5, that is, as = {ai\ i G S}. For any two vectors a,b £ M"*, (a, 6 ) 
denotes the standard inner product. We dehne the matrix £i norm as ||A||£^ = 
maxi<j<„ ^ |aij| and for any symmetric matrix B G we dehne its 

largest eigenvalue as ^max{B). For two positive sequences an, bn depending on 
n, the notation an = 0{bn) means that there exists a hnite constant G > 0 such 
that an < Cbn for a sufficiently large n, while = n(6n) means that an > Cbn- 
Also, the notation an = o(6„) means that anjbn —)• 0 as n goes to inhnity. 

2. Sparse and robust linear regression 


2.1. Model and parameter estimation 

Consider the linear regression model with outliers 


y = a:/3* + 


+ 


( 2 . 1 ) 


where y = {yi,... ,ynY' is an n dimensional response vector, X = (xjj) = 
(a?!,..., Xn)"^ is an n X p covariate matrix, (3* is a p dimensional unknown co¬ 
efficient vector, 7 * is an n dimensional unknown vector whose nonzero elements 
correspond to outliers and e is an n dimensional random error vector. In the 
model (2.1), we assume that the £2 norm of columns of X is ^/n. Correspond¬ 
ingly, the coefficient of 7 * is assumed to be y/n to match its scale with the 
columns of X. The model (2.1) can be found also in She and Owen (2011) and 


Nguyen and Tran (2013). Our purpose is to estimate £3* accurately even when 


the number of covariates is large. In this case, it is natural to assume that j3* 
has many zero elements (sparse). Moreover, we can assume that 7 * is also sparse 
since the number of outliers is usually not large. For this sparse structure, we 
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introduce sparse penalties for both coefficients and outliers. The parameters are 
estimated by solving the optimization problem 

p n 

” ( 2 . 2 ) 


1 


argmm , 

/3eiRp,7eK" 


\y- X(3- ^/n'y \\2 + 

j=l i=l 

where > 0 and A.y > 0 are tuning parameters for /3 and 7 , respectively and 
P(-) is a penalty function that encourages sparsity. We often use a redescending 
type of P(-), since it can yield a small bias against strong outliers. We consider 
the adaptive Lasso (Zou (2006)) type optimization problem. In (2.2), w/^^i and 
are known weights. Suppose that we have the preliminary estimators /3 = 
(/3i,..., and 7 = ( 71 ,... , jn)'^■ The weights are typically defined as wpj = 
l/\/3j\ and = 1/171. For the details of the preliminary estimators used in 
this paper, see Section 3.3. 


2.2. Optimization algorithm 


Algorithm 1 


Step 1. Initialize k <— 0, ^ jk ^ aigmm^L{P^,'y). 

Step 2 . Update k k + 1, 


^ argmin^L(/3,7"-'), 

(2.3) 

7^" ^ argmin..^L(,3'‘,7). 

(2.4) 

Step 3. If they converge, then output current (/3^,7^) and stop the algorithm, oth¬ 
erwise return to Step 2. 


Let L{P,w) be the objective function in (2.2). To solve the optimization 
problem ( |2.2[ ), we introduce Algorithmic This algorithm surely converges since 
it satisfies that 


L(/3°,7°) > F(/3\7°) > > • • • > 0 


from its construction. We also note that the optimization problem in (2.3) can 


be solved by a typical way such as the coordinate descent algorithm (Friedman 


et al. (2010)). The optimization problem in (2.4) can be rewritten as 


1 

argmin — V {(7 - - Vnjif + X^w^^iPi'ji)}. 

zn . , 


i=l 
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Suppose that the problem argmin 2 ,(z — x )^/2 + XP{x) has the explicit solution 
0(z;A). Then, (2.4) can be written as 


li 


! —©(l/i P j A-ylC/y j), i — 1, . . . 

wn 


n, 


h(/3"). 


where is the ith element of 7 ^. Let this expression be denoted by 7 ^ 

Then, the step 2 can be expressed with only the update of /3 as 

(3^ ^ argminL{/3,/i(/3^“^)}. 

(5&MP 

The function 0(z; A) is often called as the thresholding function. As seen 


m 


Antoniadis and Fan (2001), many sparse penalties, including the t'l, Iq and 


smoothly clipped absolute deviation (SCAD; Fan (1997)) penalties, have an ex¬ 
plicit solution 0 ( 2 ;; A). The ii penalty leads to the soft thresholding function 
0(z; A) = sgn(z) max(|z| — A, 0) where sgn(-) denotes the sign function, and the 
io penalty leads to the hard thresholding function 0 ( 2 :; A) = zl{\z\ > A), where 
I {A) denotes the indicator function on the event A. For the SCAD penalty, the 
corresponding thresholding function is given by 


0 ( 2 ; A) = < 


sgn( 2 )(| 2 | — A) if I 2 I < 2A 

if 2A < I 2 I < aX 
if I 2 I > oA, 


(a—1)2—aAsgn(.2) 
a—2 


where a = 3.7 is recommended by Fan and Li (2001). 


2.3. Connection to robust M-estimators 


Similarly in She and 0wen (2011), our procedure also has a close connection 
to robust M-estimators. 


Proposition 1 . Let /3 be the coefficient output of Algorithm^ and let 1 /^( 2 ; A) = 
2 — 0 ( 2 ; A), where 0(-; A) is a thresholding function yielded from a penalty function 
P{-). Then, for j = 1,... ,p, it follows that 

1 

- xJP] X^w^^i) + Xgwpjdlffil = 0, (2.5) 

i=l 

where d\j3j\ is the subgradient of\/3j\ which means that d\f3j\ = sgn(/3j) if 7 ^ 0, 
d\(3j\ G [—1,1] otherwise. 

















6 


SHOTA KATAYAMA AND HIRONORI FUJISAWA 


Proposition!^ shows that the output /3 satisfies the estimating equations (2.5) 
with the 'll) function and the £i penalty. Thus, our algorithm is closely related to 
the optimization problem 

^ n p 

argmin tw T] (2 .6) 

/3eMP 2n ^ 

where ^T(t;A) = i/^(t;A). It may take much computational cost to directly 


solve (2.6), particularly when the function T(.;A) is non-convex. For the fast 


computation, we can use the first-order approximation of 'h(-; A), but it loses the 
information of the original function. For this reason, we use Algorithm instead 


of solving (2.6). 


The relationship between sparse penalties and robust loss functions are stated 
through the equation Tp{z] X) = z — ©(z; A). For instance, the £i penalty corre¬ 
sponds to the Huber’s loss, the io penalty to the skipped-mean loss and the SCAD 


penalty to the Hampel’s loss. She and Owen (2011) gave their illustrations and 


more details. One measure to characterize the robust loss function is the re¬ 
descending property. It means that i/;(z;A) goes to 0, equivalently \z — ©( 2 :;A)| 
goes to 0, as | 2 ;| goes infinity. The soft thresholding function which corresponds 
to the £i penalty does not have the redescending property, but the hard and 
SCAD thresholding functions which correspond to the non-convex penalties have 
it. Some other non-convex penalties including the non-negative garrote penalty 
(Garrote; Gao (1998)) and the minimax concave penalty (MGP; [Zhang (2010)) 
also lead to that property. Thus, penalties yielding good robustness are non- 


convex. 


3. Theoretical Analysis 

The optimization problem (2.2) with a non-convex penalty P(-) suffers from 
local minima. We may be able to give a theoretical analysis for a global mini¬ 


mum of (2.2), but it is unclear which optimization algorithm can actually output 


the global minimum. To avoid such a problem, we directly analyze the output of 
Algorithmic Some theoretical analyses for computable solutions in linear regres¬ 


sion model without outliers were provided by Zhang and Zhang (2012), Fan and 


Lv (2013) and Fan and Lv (2014), but our analysis is essentially different from 


theirs. They hrst derived some properties for a global minimum and then showed 
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that a computable solution shared those properties under additional conditions 
for the solution. 


3.1. Notations 

First, we prepare some notations. Let S* = supp(/3*) = {z| (3* / 0} C 
{l,...,p} and G* = supp( 7 *) C {l,...,n}. These support sizes are written 
by s* = I (S'* I and g* = |G*|. For the preliminary estimators, similarly let S = 
supp(/3) and G = supp( 7 ) with sizes s = |5| and g = |( 5 |. We define the restricted 
smallest eigenvalue 5min as 


^min (^) 


inf 

||5||o<m 



(3.1) 


and the (doubly) restricted largest eigenvalue 5max as 




11 ^( 0^111 

sup sup -—Jo— 


(3.2) 


where ^(g) is the sub-matrix having the rows of X corresponding to the set G, 
that is, ^(G) = {xij\ i ^ G,1 < j < p}. The restricted eigenvalue is popular 
in the analysis of the Lasso (see, e.g., Bickel et al. (2009)), but our 6max is 
slightly different from the existing one. The restriction is imposed on rows of 
X in addition to columns, corresponding to the outliers. We shall provide an 
asymptotic analysis as n goes to infinity. In our theory, p, s* and g* may go to 
infinity depending on re. Notice that the restricted eigenvalues also depend on re. 


3.2. Properties of Algorithm output 

To derive a convergence property of Algorithm we require the following 
conditions for the random error vector s = (ei,... the thresholding func¬ 

tion 0(-; A) and the preliminary estimator (/3,7). 

Couditiou 1. The errors £i, ... ,£n are independently and identically distributed 
as the zero mean sub-Gaussian distribution with a parameter cr > 0, that is, 
E(ei) = 0 and E{exp(tei)} < exp(t^cj^/2) for all t G M. 

Couditiou 2. The thresholding function 0(-;A) satisfies that 0(x;A) = 0 if 
|x| < A and |0(x; A) — x| < A for all x G M. 
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Condition 3. There exist a sequence an,i —>• 0 and a constant k > 0 such that 
11/3 — / 3*||2 + II 7 — 7*||2 < CcLn,i and 6min{s) > K with probability going to one, 
where C is some positive constant. 


Condition assumes that the errors in the model (2.1) belong to the sub- 


Gaussian distribution family. As seen in Vershynin (2012), this family covers the 


Gaussian, Bernoulli and any distributions with a bounded support. Condition 
decides a class of thresholding functions (or equivalently penalties) which can 
be handled by our theory. For instance, this class includes the soft, hard, non¬ 
negative garrote, SCAD and MCP thresholding functions. The first half claim 
in Condition 1^ implies consistent preliminary estimators at the rate an,i- Under 
this condition, we have \j3j\ > 1/3^1 — |/3j — /3j| > l/3j| — C^n,! for j G S* by the 
triangle inequality. Hence, if min^g^* \ j3*\ > Can,i, then \j3j\ > 0 for j G S*. 
Similarly, we can show that if minjgG* Iq^l > Can,i then \^i\ > 0 for / G G*. 
Thus, Condition 1^ leads to the screening property that S* <Z S and G* C (5 if 


min I min 1/3*1, min | 7 *|| > Can i- 
jes* ^ i&G* ^ 


(3.3) 


The second half claim in Condition is well known as the sparse Riesz condition 


(Zhang and Huang (2008)) if s is non-random. When we use a preliminary 
estimator which has a non-random upper bound Su, it reduces to 5min{su) > 

In Section 3.3, we shall clarify the order of an,i and the non-random upper 
bound of s for some preliminary estimators. For a technical reason, we re-define 
the weights Wjsj and using a constant > 0 by 

1 

W\ 


= 


max 


1 

Ryj 


J- 

ItI’^ 


(3.4) 


-1 

'W 


for z G G and j G S. From these restrictions, it follows that min^.g^ > R, 

They are required to exclude the case where wpj —)■ 0 


and max^g^ w. 
and w. 


7,2 ^ Rvi 


7,2 


00 as n goes to infinity in our theory. Although these are the same 
as the originals if Ryj is large, it is convenient for deriving the following theories. 
Moreover, the same result holds if different constants Rwfi and Rw,'y are used in 


Under these conditions, we can show the algorithmic and statistical conver¬ 
gence property of Algorithm 
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Theorem 1. Assume Conditions 1-3 and (3.3). Let 

Cn 


Xf} > 2CRn 


\ogp 


A-y < 


n 


Ru, max .g^j Yli^G 


cj2 logp 


n 


for a constant C > \f2 and Rw > 0 used in (3.4). Then, for any iteration k > 1 
and any initial value in Algorithm^ 


k-l 

^ p\ 


11/3" - (3*\\2 < /II/3'™' - P*h + 2K-^V^Xg{R-^ + . 

1^0 

with probability going to one, where p = 2K,~^6max{s,g). 


(3.5) 


The first term of the right hand side of (3.5) can be regarded as the algorith¬ 
mic error. It represents the effect of an initial value in Algorithm This 
shows that if p < 1 then the effect vanishes from the bound exponentially as the 
number of iterations increases. Since |Cmaa:(Af)| < ||7W||£^ for any symmetric 
matrix M, we have 


^max ( 5 ) 9 ) ^ 


2 sg 

max XjA —. 
l<i<n;l<i<p n 


(3.6) 


Thus, if maxjj 2 = o{n), we have p < 1 for sufficiently large n with probabil¬ 
ity going to one, where an ,2 shall be defined later in Condition]^ The second term 
can be regarded as the statistical error. We notice that maxjg 5 » rcpj = 0(1) with 
probability going to one if min^gs* |/3j| = 0(1). In this case, the second term has 
0{^f^Xp), which is equivalent to the order of the standard Lasso excluding the 
term from the model (2.1) in advance. 

Theoremshows only the convergence result for the output. Next, we shall 
show that the output can recover the true support. We need an extra condition 
and a corollary. 

Condition 4. There exist a constant C > 0 and a sequence 0 ^, 2 , which may 
diverge, such that max(s, p) < Can ,2 with probability going to one. 


Corollary 1. Assume Conditions nLHl (3.3), maxjj \xij\ = 0(1), min^gg* \fl*A = 


0(1) and 2 = o{n). Let Xp > Cp{{\ogp)/n)Y/‘^ and A.y < C'^^{(logp)/n}^/2 
with some constants Op > 0 and C.y > 0. If there exist some Pq > 1 and 
Oo > 0 such that Pdl/?*™* — (3*\\2 < Cop~^°Y~s*Xp) —)• 1, then it follows that 
P(||/3^ — /3*||2 < C\/~s*Xp) —> 1 for any k > ko for some constant O > 0. 
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When maxjj \xij\ = 0(1) and 0^2 — o{n), we have p = o(l). Consequently, 
Corollary can be immediately obtained from Theorem The requirement 
_ ^*||2 < CQp~^°\fs*\p) —)• 1 is not so restrictive since p = o(l). If we 
use 0 for the initial value then the condition (a^ = 0{y/^Xp) is 

required. From Corollary we can show a theorem about the support recovery. 
It should be noted that we only need to analyze the elements of /3^ on 5 n S*" 
since the screening property S* C S holds. 

Theorem 2. Assume the same eonditions as in Corollary^ Iflogp = O(logn), 
V^ogn = o{^/nmmi^G* \li\), an, 2 s* = o{n) and a„,imax an,2S*, g*/y/n) = 
0 ( 1 ), then we have P(/3|^^^c = O) —)• 1 for any k > ko + 1, where ko is defined 
in Corollary^ 


3.3. Preliminary estimator and its properties 

In the previous section, some statistical properties are derived for general 
preliminary estimators. In this section, we introduce a concrete example and 
specify the orders and an ,2 in Conditions and Let us consider the Lasso 
type preliminary estimators of 6 = (/ 3 ^, 7 ^)^, given by 

6 = argmin - Z0||| + Ae||0||i, (3.7) 


where Ag > 0 is the tuning parameter and Z = (X, ^/nln)- The estimator 6 is 
a simple variant of that proposed in Nguyen and Tran (2013). Using different 
tuning parameters for f3 and 7 may improve the accuracy of estimates. However, 
even if the preliminary estimates do not have the high accuracy, we can improve 


its accuracy in calculating (2.2) with different tuning parameters. For this reason. 


it would be enough to use the single tuning parameter in (3.7). 


Based on the well-known analysis for the Lasso (see, e.g., Bickel et al. (2009) 
and Wainwright (2009)), the following property can be shown. 

Proposition 2. Assume Condition^and there exists a eonstant k > 0 such that 

> K, (3.8) 


mm 

0^O;||0..*c||i<3||0c,*||i 


\ze\\ 




where U* = supp(0*) with 6* = . Let Xg = C' 0 {(logmax(n,p))/n}^/^ 
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for sufficiently large Cq > 0. Then, it follows that 


\e-e*\\2 < c 


(s* + g*) log max(n, p) 


n 


|supp(0)| < Cf,raax{z'^Z /n){s* +g*), 
with probability going to one, where C > 0 is some constant. 


(3.9) 

(3.10) 


Note that the condition (3.8) is slightly different from that commonly used 


in the Lasso analysis. It is involved in the extended matrix Z not in X. For 
more details, see Nguyen and Tran (2013). Remember Conditions and The 


bounds (3.9) and (3.10) will correspond to the orders a^,! and 0 ^, 2 ) respectively. 
However, the term f,max{Z'^Z/n) may become troublesome since it may diverge 
as n or p increases. To exclude it, after 6 is obtained with Xq, we consider the 


threshold version of (3.7) with an additional tuning parameter rg > 0 as 


ef = 9jl{\ej\ > TeXe), j = 1, ■ •. ,n +p. 


(3.11) 


Now, let 0*^ = (0^^^,..., 0^^p)^ and an,i = {(s* + ( 7 *) logmax(n,p)/n)}^/^. Under 
the condition (3.3) in which C is replaced by 2(7, we have \6j\ > Can,i for any 
j G supp(0*). Thus, if we select tq such that T 0 XO < Can,i, then = 6j for any 
j G supp(0*). Hence, the threshold version also holds the property (3.9) with 
the same order as the original. Meanwhile, note that 


supp(0*^)\supp(0*)| = 


< 


E 

jSsupp(0*'*)\supp(0*) 

\e-e*\\l 


1 < 


E 


iL 


j0supp(0*) ® ® 




<C{s*+g*), 


if we select tq > Cr for sufficiently small Cr > 0, which implies that |supp(0*^)| < 
(1 + C){s* + g*). Here, the term ^rnaxiZ^Zjri) is excluded. Note that the 
conditions tqXq < Can,\ and tq > Cr are compatible since the order of Xq is 
smaller than or equal to that of 1 . 


4. Numerical performance 

We examined numerical performances of our procedure based on 100 Monte 
Carlo simulations. All of the tuning parameters A,g, A.y, Xg and tq were selected 
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- — *—^-^ 3 = 


\ 

\ 


□ Lasso(p=200) 

O Th_Lasso(p=200) 

A Lasso(p=400) 

Th_Lasso(p=400) 

n ^ I ^ I ^ ^ r 

1% 5% 10% 15% 20% 25% 30% 

Percentage of outliers 


Figure 1: The support size (left) and the coverage probability (right) for the two pre¬ 


liminary estimators. “Lasso” means (3.7) and “Th_Lasso” means (3.11). Each point on 


the curve shows the mean based on 100 Monte Carlo simulations. 


by the Bayesian information criteria (BIC; [Schwarz (1978)). For instance, let 
ns consider the selection of by BIC. Let /3(Aj3,Aj) and ■y(Ap,A^) be the 

outputs of Algorithm with the tuning parameters (A^jA.-,,)- Then, the optimal 
tuning parameters (A^, A..,,) are given by 


(A/ 3 ,A.y)= argmin — ||y - X/3(A/3, A.^) - y/n 7 (A/ 3 , A^) 

A^>0;A'y>0 


+ 


logn 


n 


{|supp(/3(A;3, A^))| -b |supp( 7 (A; 3 , A^))|}. 


Practically, since it is impossible to search all possible tuning parameters, we 
searched them over some candidate values which were generated by the similar 


way to in Friedman et al. (2010) 


The first two simulations were designed to see the impact of the number of 
outliers. The two scenarios {n,p,s*) = (200,200,10) (moderate dimension) and 
{n,p,s*) = (200,400,20) (high dimension) with various g* were considered. The 
covariates ajj’s were independently drawn from Ap(0,S) with = 0.3l*“-^l, 

the true coefficients were given by = sgn(nj) and the true outliers were given 
by \/n'y* = 8. The positions of the non-zero coefficients and outliers were uni- 
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formly drawn from {1,... ,p} and {1,..., n}, respectively. The e^’s and uj’s were 
independently drawn from A^(0,1). Moreover, we used Rw = 100 in (3.4) and 
Algorithmstopped when ||/3^ — /3^“^||i/s < 10“^ was satisfied at the iteration 
k. 


Pre. 


Outlier 

Criterion 

Soft 

Hard 

SCAD 

Garrote 

Lasso 

Oracle 


3.7 


5% 

^2-error 

0.1036 

0.1008 

0.1026 

0.1021 

1.7400 

0.3514 





FP 

1.08 

0.88 

0.97 

0.95 

66.86 

5.17 




TP 

10.00 

10.00 

10.00 

10.00 

10.00 

10.00 



10% 

^2-error 

0.1059 

0.1058 

0.0992 

0.1014 

9.4954 

- 




FP 

1.27 

1.09 

1.07 

1.11 

124.53 

- 




TP 

10.00 

10.00 

10.00 

10.00 

9.97 

- 



20% 

^2-error 

0.1828 

0.1436 

0.1377 

0.1485 

32.0149 

- 




FP 

3.26 

2.38 

2.62 

2.74 

153.42 

- 




TP 

10.00 

10.00 

10.00 

10.00 

9.76 

- 



30% 

^2-error 

0.6788 

0.4886 

0.4683 

0.5107 

52.2651 

- 




FP 

6.77 

5.04 

5.33 

5.68 

157.06 

- 




TP 

9.87 

9.89 

9.89 

9.89 

9.53 

- 

1 

3.11 

1 

5% 

^2-error 

0.1067 

0.1060 

0.1059 

0.1061 

- 

- 




FP 

1.05 

1.07 

1.07 

1.07 

- 

- 




TP 

10.00 

10.00 

10.00 

10.00 


- 



10% 

^2-error 

0.1097 

0.1080 

0.1084 

0.1083 


- 




FP 

1.18 

1.09 

1.10 

1.10 


- 




TP 

10.00 

10.00 

10.00 

10.00 


- 



20% 

^2-error 

0.1654 

0.1501 

0.1487 

0.1541 


- 




FP 

1.14 

1.11 

1.06 

1.11 


- 




TP 

10.00 

10.00 

10.00 

10.00 


- 



30% 

^2-error 

0.6583 

0.5377 

0.5502 

0.5788 


- 




FP 

3.66 

2.91 

2.87 

3.20 


- 




TP 

9.84 

9.84 

9.84 

9.84 

- 

- 


Table 1: Numerical performances of the proposed procedure for various outlier percent¬ 
ages when {n,p,s*) = (200,200,10). Each value shows the mean based on 100 Monte 
Carlo simulations. 


Figure shows the support size s + g (left) and the coverage probability 
P(S'* C S,G* C G) (right) for the two preliminary estimators (3.7) and (3.11) 
when the percentage of outliers increases from 1% to 35%. It can be seen that 
the two preliminary estimators performed well if the percentage of outliers was 


lower than around 20%. The threshold version (3.11) had a smaller support 


size than (3.7), but its coverage probability was worse. We also notice that the 
coverage probability tended to be low as the percentage of outliers increased. It 
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would come from the violation of the condition (3.3). As the number of outliers 


increases, the order 0^,1 increases. 


Pre. 

Outlier 

Criterion 

Soft 

Hard 

SCAD 

Garrote 

Lasso 

Oracle 


3.7 

5 % 

^2-error 

0.2287 

0.2139 

0.2041 

0.2087 

3.8340 

1.0337 




FP 

2.32 

2.05 

2.04 

2.09 

97.95 

21.69 



TP 

20.00 

20.00 

20.00 

20.00 

19.98 

20.00 


10 % 

^2-error 

0.3987 

0.3096 

0.3039 

0.3218 

8.5953 

- 



FP 

6.03 

5.02 

5.33 

5.30 

134.80 

- 



TP 

20.00 

20.00 

20.00 

20.00 

19.87 

- 


20 % 

^2-error 

3.9041 

3.5611 

3.5540 

3.6990 

19.5640 

- 



FP 

21.51 

18.74 

19.77 

20.44 

158.11 

- 



TP 

19.01 

18.99 

19.00 

19.09 

18.79 

- 


30% 

^2-error 

13.7451 

13.0448 

13.7788 

13.9348 

30.6065 

- 



FP 

43.13 

40.86 

42.78 

43.41 

169.16 

- 



TP 

15.85 

15.81 

15.90 

15.82 

17.52 

- 

( 

3.11 

5% 

^2-error 

0.2502 

0.2280 

0.2232 

0.2293 

- 

- 



FP 

2.17 

2.06 

1.98 

2.10 

- 

- 



TP 

20.00 

20.00 

20.00 

20.00 

- 

- 


10% 

^2-error 

0.4013 

0.3253 

0.3137 

0.3345 

- 

- 



FP 

2.91 

2.43 

2.57 

2.56 

- 

- 



TP 

19.99 

19.99 

19.99 

19.99 

- 

- 


20% 

^2-error 

4.1649 

4.0372 

3.9793 

4.0969 

- 

- 



FP 

19.78 

19.04 

18.81 

19.04 

- 

- 



TP 

18.99 

19.01 

19.00 

19.00 

- 

- 


30% 

^2-error 

13.2793 

13.9873 

13.7711 

14.3158 

- 

- 



FP 

41.10 

41.02 

41.00 

41.39 

- 

- 



TP 

16.02 

16.03 

15.99 

16.00 

- 

- 


Table 2: Numerical performances of the proposed procedure for various outlier percent¬ 
ages when {n,p,s*) = (200,400,20). 

Tables and show the squared £ 2 -error ||/3 — / 3 *|| 2 , the number of the 
false positives \{j\(3j = 0,/3j 7 ^ 0}| (FP) and the number of the true positives 
\{j\(3j / 0,f3j 7 ^ 0}| (TP) for various penalties for outlier parameters. We used 
the preliminary estimator for in Algorithm We considered the “Soft”, 
“Hard”, “SCAD”, and “Garotte” thresholding functions. Only the Soft does 
not have the redescending property. The Garrote has a different behavior from 
the Hard and the SCAD, in fact, its A) function never vanishes if z is fi¬ 
nite. For the comparison, we also investigated the performances of the standard 
“Lasso” and its “Oracle” version where the true outliers are excluded in advance. 
The symbol means that the standard Lasso does not depend on preliminary 
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estimators and its oracle version does not so on outliers additionally. 

As seen in Table for the moderate dimension, our procedure provided 
quite good estimates and recovered the true support well. Interestingly, the 
performance was better than the Oracle. This would be because the error e 
can yield extreme values in the simulation and the Oracle is not robust against 
these values. As seen in Tablehowever, for the high dimension and large outlier 
percentages, our procedure did not exhibit good performances. This result would 
come from the bad coverage probability of the preliminary estimators. Compared 
between the preliminary estimators, (3.11) performed better than (3.7) for the 
true support recovery, but the opposite was true for the ^ 2 -error. It is also noted 
that the Soft performed worse than the other thresholding functions. This would 
be explained by the redescending property. 



Magnitude of outliers 



Figure 2: The squared € 2 -error (left) and the false positives (right) for various magni¬ 
tudes of outliers. Each point on the curve shows the mean based on 100 Monte Carlo 
simulations. 

The final simulations are designed to investigate the impact of the magnitude 
of outliers. We used {n,p,s*) = (200,400,20) with g* = 20 (10% outliers) and 
various magnitudes of y/n'y* ■ We considered the situations = 7* In with 

7 * G {2,4,6,..., 14}. The Monte Carlo samples were generated by the same way 
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as above. In Figurej^ only the performances of the Soft and Hard are shown. The 
SCAD and Garotte performed similarly to the Hard. The true positives are also 
omitted since they were around 20 for all the cases considered. Our procedure 
performed well with low and high magnitudes, but it did not so with a moderate 


magnitude. This also would be come from the violation of the condition (3.3). 


For a low magnitude, the outliers would be hidden by the random errors. When 
e is drawn from A^n(0, /„), the maximum magnitude of e^’s is less than yj2 logn 
(it is around 3.3 in this case). We also note that the performances tended to be 
stable as the magnitude increased. 

5. Proofs 

5.1. Proof of Proposition 

Let (/3,7) be the output of Algorithm Then, from (2.3), we have for 

j = 1, 

1 " 

- xfp - ^/n^i) + Xp,jd\Pj\ = 0 . 

” i=i 


Since X) = z — 0(z; A), it follows from (2.4) that 


1 

n 


^Xijivi - xf$ - y/n^i) = - - xfp - e{yi - xf 

^ i=i 

1 ” 

= - - xfp-X^w^^i), 

i=l 

which ends the proof. 

5.2. Proof of Theorem [T] 

To prove the theorem we prepare the next lemma. 

Lemma 1. Let Zi,...,Z„ he independently identically distributed as the zero 
mean sub-Gaussian distribution with a parameter a > 0. Then, for any vector 
a G M"' and any t >0, 

a 




2=1 


> t < 2 exp — 




2cj2||a||2 
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Proof. Let Z = ^11=1 From the Markov’s inequality, we have for u > 0 

F(z>t^ = P(e“^ > e“*) < < exp 

^ i=l 

Note that u = t/((j^||a|| 2 ) minimizes —rtf + rt^cr^||a||2/2 over u > 0. Thus, 
F(Z > t) < exp{—t^/(2cj^||a||2)}. Similarly, F(Z < —t) < exp{—t^/(2c7^||a||2)} 
and then from P{\Z\ > t) < P{Z > t) + P{Z < —t) we obtain the lemma. □ 


— ut + 


n^cr^llal 


Since we consider the adaptive Lasso type estimator and the screening prop¬ 
erty that S* C S and G* C (5 is satisfied under Condition and (3.3), it suffices 
to focus only on the covariates selected by the preliminary estimator f3, that is, 
on the sub-matrix = {xij\ 1 < i < n;j G S'}. Correspondingly, we only 
focus on the coefficients in the set S. In this section we omit the subscript S for 
simplicity, therefore keep in mind that all of the following X, and (3* have 
the s dimension. 

We shall show the bound (3.5) on the event that Condition is satished and 


on 


E = 


-XPe 


n 




< c 


0-2 log p 


n 


(5.1) 


both of which have probabilities going to one, where X^q^-^ = {xij\ i G G^,j G Sj, 


= {si\ i G G^} and G > \/2. In fact, from Lemma and Condition 


we 


have for given the preliminary estimators (3 and 7 , 


p(f;) = 1 - = i -: 


-X^ e- 


> G 


jeS i&G^ 

>1 — 2 ^ exp(— 2 “^G^ logp) 

165 

> 1 — 2 exp(logp — 2 “^C^ logp) = 1 — 0 ( 1 ). 


logp \ 


n 


I 


Note that the lower bound does not depend on the preliminary estimators, and 
hence the probability of E goes to one. 
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Since /3^ = argmin^L(/3, 7 ^ ^), it follows that < L{f3*,'y^ ^). 

Hence, we have 


i 6 S 


= Ai+ A 2 , say, 


where = (3^ — [3*. First, we evaluate the term Ai. Since y/n'yf~^ = Q{yi — 
xf (3^~^-, X^w^^i) and = 0 when i G the j-th element of X'^{y — 

Xf3* — -v/n 7 ^“^) is given by 


'^Xijivi - xj(3* -Q{yi-xjp^ ^ Xij{yi - xj(3*) 

i&G<^ 

'^Xij{xJ+ {yi - - Q{yi - xjl3^-^]\^w^^i)] 


i&G 


i&G 


+ Xij{yi - xj(3*). 

i&G<^ 


Thus, we obtain 
1 




j&S i&G 

+ - ^ ^ Xij{yi - xj(3*) = An + Au + Ais, say. 

j£S i£G‘^ 

Since ||A^||o < s for any k > 1, the definition of the (doubly) restricted largest 
eigenvalue in (3.2) implies that 


lAlllI < i||X(^)A'=-'||2||X(g)A''||2 < dn,ax{~S,m^' 

Note that |0(x; A) — x| < A under Condition]^ Then, 


I A I ^ RwX'y II 


A''||i < C 


logPii * fc. 


n 


(5.2) 


(5.3) 


where Rw > 0 is defined in (3.4). Since G* C G, we have 7 * = 0 for i G G‘^. 


Then, from (5.1), we have 


1 


l^isl < I|A^(g7^G'=II°oII^ lb — 


A'^lh < G 


n 




n 


(5.4) 
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For the term A 2 , since S* C S, 


-42 < ^ + Af I - A^ ^ WfjW' + A 

ies* j&s* j(zs*’= 

< A/3 ^ ^ 


ies* jes 

. /c 


< A/3maxR;y3,il|A|*||i - ^||A^*c||i 


Combined with (5.2)-(5.4), it follows from ||A^||i = ||Ag*||i + ||A 




that 


2n 


<5max{s,. 


|A^-i||2l|A'=||2 + 2C 


logPii . fcl 


+ A^maxR;^j||A|*||i - ^||A^ 
j&S* Kw 


n 

s*=lll 


— ^max (5), 


+ 2C 


. fc-ii 


|2||A'‘||2+ \ \/3inaxwi3j + 2C 


jes* 


a 


■ logp \ 


n 


|a|*||i 


<7^ logp _ jV 

R'in 


lA^ 


n 


‘S*=lll 


< 6max{s,g)\\X^ ^||2||A''||2 + \p(Rj +maxrc;3j)\/^||A^||2 

jes* 


From Conditionand (3.1), we have ^HACA^lH > «:||A^||2. Thus, for A: > 1, 

|| A *^||2 < p || A *^“^||2 + 2 K ~^ y /^ Xi 3 [ R ~^ + maKWisj ). 

jGS* 


Let A^ = — (3*. Then, the bound (3.5) is derived by solving the above 

recurrence relation for /c = 1, 2 ,..., which ends the proof. 


5.3. Proof of Theorem [2] 

Throughout this section we denote positive constants by Cj (i > 1) which 
may be different from each other. Suppose that for some £ G 5 n S*'", 7^ 0. 

Without loss of generality, we can assume > 0. Then, from the hrst order 
condition for (3^, the value 

1 n 1 " r 

j&S *=1 ^ ie5 


I + Xjswis^e 


(5.5) 
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should be zero. But, if we can show the first two terms are dominated by the 
third term for any ^ G 5 H S*", then the above value can never be zero, 

which leads to the contradiction. 


First, we evaluate the middle term of ( 5 . 5 ). From the dehnition, the inside 
of 0 is represented as 


i6S 


V^li + ^ )’ i € G* CiG — G* 

G - - /?;), i G n G. 


By Lemmawe can show that maxi<j<„ |ei| < Gl^J\ogn with probability going 
to one. Let A/j = C' 2 {(logp)/n}^/^, then it follows from Corollary and an, 2 S* = 
o{n) that 


max 

l< 2 <n 


ei 


j^s 



< max |ej| + G^VsWP^ ^ 

l<i<n 


p* 


< C' 4 v/logn(l + o(l)), 


which implies that if A..,, min.g(^*C|_|^ > G 4 ^/logn, then yi — Ylj^s = 0 

for i G G*'' n G. Note that min^gg*c= Rw for sufficiently large n since 
min^gg,.C|_|g,(l/| 7 j|) > l/{Gan,i) —)• oo. Hence it suffices to put \.y = Cs-^/logn 
for a sufficiently large C 5 > 0 . Such a A..,, satisfying the condition of Corollary 
can be selected since a ‘^2 — o{n). Meanwhile, since \/\ogn = o{^/nT^l\n.i^G* l 7 i*l)) 
we have yi — + o(l)) for i ^ G*. Thus, for sufficiently 

large n, it holds that minigG* lUi — Ylj^s > A.y maxjgG* 

Therefore, under Condition 


j&s 


Vi - ^ + 0 {^J\ogn), i G G* 

0 , i G G*' n G. 
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Then, it follows that 


^ ^ ‘I Vi ®{yi ^ ^ ^ij /^7 )r 

^ ieS 


i&G 


y~l Vi - XijPj ^ 


16S 


ieG= 


^ Xi£ ^ Xij/3^ 1 ^ + X] a;i£yi + ^ Xj^yi 

*eG* jg5 i£G* ieG*‘^nG jeG= 

^ Xi£^Xij{/3* + - ^j)} + X] + 0(g*^/logn) 

»eG* jgs 


ieG* 


Xi£Xij/3* + ^ '^Xi£Xi0^j ^ - ^*j)+ XiiEi + 0(5Vlogn). 

i=ljgs ieG* jg5 


ieG* 


Thus, (5.5) can be written as 


i=ljgs ieG* jg5 

'g* 


- ^ Xj^e* + of — v^logpj +\i3Wp^i. 


Clearly the fist and second terms have the order ■^an, 2 S*\p- From the proof of 
Theoremj^ the third term is of order {(logp)/n}^/^. By an^i max {^an, 2 S*, g* j y/n) 
o(l), the first four terms of (5.5) are dominated by min^.g^p|_j*c since 
min^.g_5i-^^*c > l/(C'a„,i). 
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